Kickstarter is an American public-benefit corporation based in Brooklyn, New York, that maintains a global crowd funding platform focused on creativity. The company’s stated mission is to “help bring creative projects to life”. For this assignment, we analyze the descriptions of kickstarter projects to identify commonalities of successful (and unsuccessful projects) using the text mining techniques.

Question 1

  1. Identifying Successful Projects
  1. Success by Category There are several ways to identify success of a project:

State (state): Whether a campaign was successful or not. Pledged Amount (pledged) Achievement Ratio: Create a variable achievement_ratio by calculating the percentage of the original monetary goal reached by the actual amount pledged. Number of backers (backers_count) How quickly the goal was reached (difference between launched_at and state_changed_at) for those campaigns that were successful. Use one or more of these measures to visually summarize which categories were most successful in attracting funding on kickstarter. Briefly summarize your findings.

Based on the output above, we can infer that for the average success rate category, the difference between most project types is not very large. However, we do see that projects from categories such as Dance, Comics, Publishing, Music and Theater projects seem more likely to be a success.

On the other hand, we see a few clear strong performers when looking at achievement ratio and average number of backers. The top performing projects in terms of the average backers count seems to be Games, Design followed by Technology. When it comes to average achievement ratio, Design and Games are on the lead. It is interesting to see other patterns too. For instance, for Comics, the success rate is much higher when we use the average success rate than the average achievement ratio or the average backers count.

BONUS ONLY: b) Success by Location Now, use the location information to calculate the total number of successful projects by state (if you are ambitious, normalize by population). Also, identify the Top 50 “innovative” cities in the U.S. (by whatever measure you find plausible). Provide a leaflet map showing the most innovative states and cities in the U.S. on a single map based on these information.

## OGR data source with driver: GeoJSON 
## Source: "/Users/samikshya/Desktop/dataviz/gz_2010_us_040_00_500k.json", layer: "gz_2010_us_040_00_500k"
## with 52 features
## It has 5 fields

I have first generated longitudes and latitudes for the different states and cities and then merged this data with the spatial data geojson file to be able to make the maps.

There are two maps embedded in the leaflet map generated from the list of top states and top cities. One can select the view based on city or state. The radius of the cities shows the amount of projects in that city and the depth of the color of states the amount of projects there i.e. darker blue means more number of projects happened in the state. States with no projects are shaded with gray.

Question 2

  1. Writing your success story Each project contains a blurb – a short description of the project. While not the full description of the project, the short headline is arguably important for inducing interest in the project (and ultimately popularity and success). Let’s analyze the text.
  1. Cleaning the Text and Word Cloud To reduce the time for analysis, select the 1000 most successful projects and a sample of 1000 unsuccessful projects. Use the cleaning functions introduced in lecture (or write your own in addition) to remove unnecessary words (stop words), syntax, punctuation, numbers, white space etc. Note, that many projects use their own unique brand names in upper cases, so try to remove these fully capitalized words as well (since we are aiming to identify common words across descriptions). Stem the words left over and complete the stems. Create a document-term-matrix.

Provide a word cloud of the most frequent or important words (your choice which frequency measure you choose) among the most successful projects.

A

For the wordcloud, I am using the term frequency scores from corpus created using the blurb from sucessful projects.

The second wordcloud is generated using the term frequency scores from corpus created using the blurb from failed projects. This is created just to see the difference between the two sets of wordclouds.

B

Provide a pyramid plot to show how the words between successful and unsuccessful projects differ in frequency. A selection of 10 - 20 top words is sufficient here.

## [1] 5.1 4.1 4.1 2.1

C

  1. Simplicity as a virtue These blurbs are short in length (max. 150 characters) but let’s see whether brevity and simplicity still matters. Calculate a readability measure (Flesh Reading Ease, Flesh Kincaid or any other comparable measure) for the texts. Visualize the relationship between the readability measure and one of the measures of success. Briefly comment on your finding.

Based on the plot it seems that the projects with denser blubr i.e. textual content (based on the FK grade level) tend to be less succesful. Thus, here it seems that more complex language is not a virtue for higher achievement rate.

Question 3

  1. Sentiment Now, let’s check whether the use of positive / negative words or specific emotions helps a project to be successful.
  1. Stay positive Calculate the tone of each text based on the positive and negative words that are being used. You can rely on the Hu & Liu dictionary provided in lecture or use the Bing dictionary contained in the tidytext package (tidytext::sentiments). Visualize the relationship between tone of the document and success. Briefly comment.

A

There is not much difference to the the achievement ratio as per sentiment. It may work both ways as positive words may be appealing to generate enthusiasm but negative words may appeal to compassion. And negative words may be used in blurbs generating funds to fight these issues such as- describing the background of social problems such as sexual violence or suicides to generate funds for projects that address these problems.

B

  1. Positive vs negative Segregate all 2,000 blurbs into positive and negative texts based on their polarity score calculated in step (a). Now, collapse the positive and negative texts into two larger documents. Create a document-term-matrix based on this collapsed set of two documents. Generate a comparison cloud showing the most-frequent positive and negative words.

C

  1. Get in their mind Now, use the NRC Word-Emotion Association Lexicon in the tidytext package to identify a larger set of emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust). Again, visualize the relationship between the use of words from these categories and success. What is your finding?

We can see that the achievement ratio falls as the number of words with positive emotions such as trust, positive increases whereas if the number of negative words in the blurb increase i.e. more negative emotions seems to have a positive relation towards the achievement rate. For instance- disgust and fear emotions are associated with higher acheivement rates for the projects.